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Abstract 



Model checking is a crucial part of any statistical analysis. As educators tie models 
for testing to cognitive theory of the domains, there is a natural tendency to represent 
participant prohciencies with latent variables representing the presence or absence of the 
knowledge, skills, and prohciencies to be tested (Mislevy, Almond, Yan, & Steinberg, 2001). 
Model checking for these models is not straightforward, mainly because traditional y^-type 
tests do not apply except for assessments with a small number of items. Williamson, 
Mislevy, and Almond (2000) note a lack of published diagnostic tools for these models. 

This paper suggests a number of graphics and statistics for diagnosing problems with 
models with discrete prohciency variables. A small diagnostic assessment hrst analyzed 
by Tatsuoka (1990) serves as a test bed for these tools. This work is a continuation of 
the recent work by Yan, Mislevy, and Almond (2003) on this data set. Two diagnostic 
tools that prove useful are Bayesian residual plots and an analog of the item characteristic 
curve (ICC) plots. A y^-type statistic based on the latter plot shows some promise, but 
more work is required to establish the null distribution of the statistic. On the basis of the 
identihed problems with the model used by Mislevy (1995), the suggested diagnostics are 
helpful to hypothesize an improved model that seems to £t better. 

Key words: Bayesian methods, Bayesian residual, item fit, Markov chain Monte Carlo, 
model fit, person fit, posterior predictive model checking 
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1. Introduction 



Model checking is a crucial part of any model-based statistical analysis, providing a 
vital sanity check that the theory underlying the model can actually predict the phenomena 
observed in the data. Model checking can identify individuals whose responses are not 
explained well by the model, and it can suggest improvements to the model and hence the 
underlying process that generated the data. Thus, model checking is an important part of 
the round trip between theory and empirical observation, which is the basis of the scientihc 
method. 

Model checking in educational testing presents special challenges as the part of the 
model describing student prohciency almost always consist purely of latent variables. The 
bulk of work to date, which is nowhere near completion, is based on unidimensional item 
response theory (IRT) models where the prohciency model consists of a single continuous 
latent trait (e.g., van der Linden & Hambleton, 1997). 

The models with discrete prohciencies are of particular interest because of their ability 
to capture expert opinion about the prohciencies used to solve assessment problems and 
their interrelationships (Mislevy & Gitomer, 1996; Mislevy, Steinberg, & Almond, 2003). 
However, there is a severe lack of well-established diagnostic tools (Williamson, Almond, 

& Mislevy, 2000) for these models. The standard x^-type test does not apply, except for 
assessments with a small number of items. 

This paper explores a number of approaches to assess the ht of models with student 
prohciency consisting of discrete variables, particularly those in which the distribution 
of the prohciency variables can be described using a Bayesian network. The paper then 
applies these techniques to a data set from Tatsuoka’s (1990) research on mixed number 
subtraction with middle school students. This work is an extension of the recent work on 
model checking by Yan, Mislevy, & Almond (2003) on this data set. Appropriately created 
Bayesian residual plots help us to improve upon a simple model with discrete prohciency 
variables ht to the data set by Mislevy (1995). This paper suggests an “item ht plot,” 
an equivalent of the standard item characteristic curve (ICC) applicable to models with 
discrete prohciency variables, and an attached y^-type test statistic. These plots and the 
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test statistics detect a number of problems with the model and flag two problematic items 
in the test. The posterior predictive model checking method (Rubin, 1984; Gelman, Meng, 
& Stern, 1996) is also applied to the model, but the discrepancy measures (which are 
equivalent to the classical “test statistics”) used with the method do not seem to have 
enough power to detect item £t or overall model fit in this example. However, the measures 
appear to be promising in diagnosing person misfits. 

The next section begins with a description of the mixed number subtraction problem, 
which motivates this work and will be used throughout the paper. The section then reviews 
the Almond and Mislevy (1999) framework for educational testing models and the particular 
use of Bayesian networks to model student proficiencies and item outcomes. The section 
introduces a data set from Tatsuoka’s (1990) work and then describes a specific model with 
discrete proficiency variables, called the two-parameter latent class model or “2LC” model 
hereafter, fit to the data set by Mislevy (1995). Finally, it gives a brief overview of Bayesian 
analysis and the Markov chain Monte Carlo (MCMC) algorithm, which is used for fitting 
all the models in this work. Section 3 reviews a number of approaches to model checking 
for these or related models. Section 4 starts by providing a summary of the method for 
fitting the 2LC model to the data set described in Section 2 and the results obtained. 

The section then applies a number of diagnostic procedures, including item £t plots and 
related y^-type test statistics, to the 2LC model. The diagnostics indicate that the 2LC 
model is inadequate to explain the variability in the data set. Section 5 introduces three 
new models, all involving discrete proficiency variables, that are possible improvements to 
the 2LC model, and applies the model diagnostics discussed earlier to those models. An 
extended version of the 2LC model, referred to as the “3LC” model here, seems to explain 
the data set satisfactorily. Section 6 discusses the performance of the diagnostics and makes 
recommendations for both practical application and future research. 

2. Background 

This section reviews some background material and introduces the small example 
assessment (based on mixed number subtraction), which we will analyze in later sections. 
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The Mixed Number Subtraction Example 



Increasingly, users of educational assessments want more than a single summary 
statistic out of an assessment. They would like to see a prohle of the state of acquisition of 
a variety of knowledge, skills, and prohciencies for each learner. One technique for prohle 
scoring is the rule space method of Tatsuoka (1983). Rule space analysis starts with a 
cognitive analysis of a number of tasks in a domain to determine the “attributes,” which are 
important for solving different kinds of problems. The experts then produce a Q-matrix, an 
incidence matrix showing for each item in an assessment in which attributes are required to 
solve that item. To illustrate the rule space method, we introduce what will be a running 
example used through the paper — one regarding a test on mixed number subtraction. 

This example is grounded in a cognitive analysis of middle school students’ solutions 
of mixed-number subtraction problems. Klein, Birnbaum, Standiford, & Tatsuoka (1981) 
identify two methods of solution for these problems: 

• Method A: Convert mixed numbers to improper fractions, subtract, then reduce if 
necessary. 

• Method B: Separate mixed numbers into whole number and fractional parts, subtract 
as two subproblems, borrowing one from the whole-number minuend if necessary, then 
simplify and reduce if necessary. 

We focus on students learning to use Method B (giving us 325 students). The cognitive 
analysis mapped out a flowchart for applying Method B to a universe of fraction subtraction 
problems. A number of key procedures appear, a subset of which are required to solve a 
given problem according to its structure. To simplify the model, we eliminate the items for 
which the fractions do not have a common denominator (leaving us with 15 items). The 
remaining procedures are as follows: 

• Skill 1: Basic fraction subtraction. 

• Skill 2: Simplify/reduce fraction or mixed number. 

• Skill 3: Separate whole number from fraction. 
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• Skill 4: Borrow one from the whole number in a given mixed number. 

• Skill 5: Convert a whole number to a fraction. 

Furthermore, the cognitive analysis identihed Skill 3 as a prerequisite of Skill 4, that 
is, there are no students who have Skill 4 but not Skill 3. Thus, there are only 24 possible 
combinations of the hve skills that a given student can possess. 

Table 1 lists 15 items from the data set collected by Tatsuoka (1990), characterized 
by the skills they require. The part of the table marked “Skills required” represents the 
Q-matrix. 



Table 1. 

Skill Requirements for the Mixed Number Subtraction Problems 



Item 


Text of 


Skills required 


Evidence 


no. 


the item 


1 


2 


3 


4 5 


model 


2 


6 4 

7 7 


X 








1 


4 


3 3 

4 4 


X 








1 


8 


11 1 
8 8 


X 


X 






2 


9 


H-H 


X 




X 




3 


11 


4f-lf 


X 




X 




3 


5 


3|-2 


X 




X 




3 


1 


3 I - 2- 

'^2 ^2 


X 




X 


X 


4 


7 


4 I - 2- 

^3 ^3 


X 




X 


X 


4 


12 


73 4 
' 5 5 


X 




X 


X 


4 


15 


4 I - l5 
^3 -^3 


X 




X 


X 


4 


13 


4— - 2— 
^10 ^10 


X 




X 


X 


4 


10 


2-1 


X 




X 


X X 


5 


3 


3 - 2 I 


X 




X 


X X 


5 


14 


7-l| 


X 




X 


X X 


5 


6 


4 ^- 2 ^ 


X 


X 


X 


X 


6 



A number of features of this data set can be learned by studying Table 1. First, note 
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that many rows of the Q-matrix are identical, corresponding to a group of items that 
require the same set of skills to solve. Following the terminology of evidence centered design 
(Mislevy et ah, 2003) we call the patterns corresponding to the rows, evidence models. 

Second, note that certain patterns of skills will be indistinguishable on the basis of 
the results of this test (even assuming no chance errors). For example, because every item 
requires Skill 1, the 12 prohles that lack Skill 1 are indistinguishable on the basis of this 
data. Similar logic reveals that there are only nine equivalence classes of student prohles. 
Table 2 describes the classes by relating them to the evidence models. 

Table 2. 

Skill Combinations for Each Equivalence Class 



Equivalence Class EM 



class 


description 


1 


2 


3 


4 


5 


6 


1 


No Skill 1 














2 


Only Skill 1 


X 












3 


Skills 1 & 3 


X 




X 








4 


Skills 1, 3, & 4 


X 




X 


X 






5 


Skills 1, 3, 4, &: 5 


X 




X 


X 


X 




6 


Skills 1 & 2 


X 


X 










7 


Skills 1, 2, & 3 


X 


X 


X 








8 


Skills 1, 2, 3, & 4 


X 


X 


X 


X 




X 


9 


All skills 


X 


X 


X 


X 


X 


X 



Often, distinctions among members of the same equivalence class are instructionally 
irrelevant. For example, students judged to be in Equivalence Class 1 would all be assigned 
remedial work in basic subtraction, so no further distinction is necessary. 

Tatsuoka (1990) analyzes this data set using her rule-space methodology (Tatsuoka, 
1983), which uses a pattern-matching approach to handle uncertainty. Mislevy (1995) later 
reanalyzed the data using Bayesian networks (Section 2). Instead, we will develop a more 
formal item response (IR) model for this problem. 
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Evidence- centered Design Framework 



Mislevy (1995) recasts the mixed number subtraction example as an IR model. Unlike 
the rule space model where the error model is implicit in distance metric used for matching, 
we explicitly model the probability that a participant with attributes 0* = {9n , . . . , Oix} 
will get a response vector Xj. Using an explicit error model makes it easy to check how 
well it models the data and possibly suggests improvements to the model on the basis 
of diagnostic statistics and graphs. Thus, we will follow this formulation of the example 
throughout the paper. 

Almond and Mislevy (1999) lay out a general formulation for educational testing 
models, which form the basis of evidence-centered design (Mislevy et ah, 2003). The model 
starts by postulating a number of proficiency variables, 6i = {6n, . . . These are 

latent variables describing knowledge, skills, and abilities of the participant we wish to draw 
inferences about. (Note that these are sometimes referred to as person parameters in the 
IRT literature. However, as there is no difference in Bayesian statistics between unknown 
parameters and latent variables, we use the term variable to emphasize its person-specihc 
nature.) The distribution of these variables, P{6i), is known as the proficiency model. In 
the mixed number subtraction example, the prohciency model consists of the distribution 
of hve binary variables related to the presence or absence of the hve skills. 

Let Xjj denote the scored outcome of the i-th. participant to the j-th. task. Note that 
in general this can be a vector valued quantity; that is, a single “task” could consist of 
multiple “items.” Note also that these outcomes are scored, which implies some level of 
processing from the raw response. In our example, each item produces a single dichotomous 
outcome that is 1 if the response is correct and 0 if it is incorrect. One could imagine other 
ways of processing the responses in this situation, for example, providing two outcomes: 
one for whether the response was correct or not and one for whether the response was 
reduced to the simplest form. Such a scheme might be more useful for producing diagnostic 
feedback, but is not further considered here. 

Next, we make two critical assumptions. The first is that all participants are 
statistically independent. The second is that given the prohciency variables, the observed 



6 




outcomes for different tasks are independent. (In the case of several items that were 
dependent, for example a reading testlet, we would group them into a single “task” to make 
them independent of other tasks). Under this assumption, the joint distribution of the 
prohciency variables and all of the outcome variables is: 

I J 

i=l j=l 

where izj is the parameters of the distribution of Xij given 0, and A is the parameters of 
P(0j). We call the term tt^) the link model because it provides a link between 

the latent prohciency variables and the observable outcomes. 

If P{6i) and Pj{Xij\6i, ttj) are known, then it is simple to compute the probability of a 
prohciency prohle for the participant. Applying Bayes theorem, we hnd P(0j,7Tj|Xj) and 
make diagnostic recommendations on the basis of this distribntion. 

In the usual situation, however, P(0i) and Pj(Xij\9i,7Tj) are only known up to the 
values of certain parameters, tt and A. We make two assnmptions about the independence 
of the parameters. We assnme that a priori A is independent of ttj, and ttj and itji are 
independent for j ^ j' . In this case, onr model becomes: 

P(A)J]Pj(,r,) (1) 

\j=l j = l ) i = l 

Fignre 1 shows this model graphically. 

Note that while the link model for each item dihers in its parameters, the fnnctional 
form is often the same. Table 1 identihed six evidence models in the mixed nnmber 
snbtraction data. The fnnctional form dihers across evidence models (each will rely on a 
diherent nnmber of 9iks)] however, it will be the same within a given evidence model. It 
shonld be possible to exploit the evidence model strnctnres to bnild hierarchical models for 
the parameters, bnt we have not done this here. 

Mixed Number Subtraction Proficiency Model 

The Mislevy (1995) model for the mixed nnmber snbtraction problem follows exactly 
this framework laid out above. It starts with hve prohciency variables, {0ji, . . . , 0 * 5 }, 
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Figure 1. The graphical representation of Equation 1. 

corresponding to the five skills identified above. Each of these is an indicator variable, 
which takes on the value 1 if the participant has mastered the skill and the value 0 
otherwise. The prior (population) distribution P(0|A) is expressed as a discrete Bayesian 
network or graphical model (Pearl, 1988; Lauritzen & Spiegelhalter, 1988). The Bayesian 
network uses a graph to specify the factorization of the joint probability distribution over 
the skills. Note that the Bayesian network entails certain conditional probability conditions, 
which we can exploit when developing the Gibbs sampler for this model. Figure 2 shows 
the dependence relationships among the skill parameters provided by the expert analysis 
(primarily correlations, but Skill 1 is usually acquired before any of the others so all of the 
remaining skills are given conditional distributions given Skill 1). It corresponds to the 
factorization 

p{0) = p{0^\9w N)p{d i\dw n)v{(^W ^2, 95)p{65\6i, 02)p(^2|^i)p(^i) ■ 

Prior analyses revealed that Skill 3 is a prerequisite to Skill 4. A three-level auxiliary 
variable 9\yn incorporates this constraint. Level 0 of 9wn corresponds to the participants 
who have mastered neither skill; Level 1 represents participants who have mastered Skill 3 
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but not Skill 4; Level 2 represents participants who mastered both skills. The relationship 
between and 9 ^ and 6^4 are logical rather than probabilistic; but they can be represented 
with probability tables with Is and Os. 




Figure 2. The graphical representation of the stndent model for mixed nnmber sub- 
traction example, p{6) = p{d3\dwN)p{d4\9wN)p{9wN\9i,92,95)p{95\9i,92)p{92\9i)p{9i) . 

The parameters A of the graphical model are defined as follows: 



Ai 

^2,m 

^5,m 

N,m,n 



P{9, = 1 ) . 

P(6*2 = l\ 9 i = m) for m = 0 , 1 . 

P(6*5 = l\ 9 i + 62 = m) for m = 0, 1, 2 . 

P{9wn = iT'\9i + 62 + 9 ^ = m) for m = 0 , 1 , 2, 3 and n = 0 , 1 , 2 . 



Finally, we require prior distributions P(A). We assume that Ai, A2, A5, and Xwn 
are a priori independent. They will be a posteriori dependent because the 6 variables are 
latent (Madigan & York, 1991). However, the MCMC analysis will take that dependence 
into account. 

The natural conjugate priors for the components of A are either beta or Dirichlet 
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distributions. In all cases, we chose the hyper-parameters so that they sum to 27 (relatively 
strong numbers given the sample size of 325). With such a complex latent structure, strong 
priors such as the ones here are necessary to prevent problems with identihability. These 
must be supported by relatively expensive elicitation from the experts. Here, we have given 
numbers that correspond to 87% for acquiring a skill when the previous skills are mastered 
and 13% for acquiring the same skill when the previous skills are not mastered. They are 
as follows: 



Ai ~ Beta(23.5,3.5) 
A 2 ,o ~ Beta(3.5,23.5) 
As,! ~ Beta(23.5,3.5) 
As,o ~ Beta(3.5,23.5) 
As,i ~ Beta(13.5,13.5) 
Ag ,2 ~ Beta(23.5,3.5) 



XwN,0,- 


— (AvyAr,o,o, Ai4/Ar,0,i, 


XwN,l,- 


~ Dirichlet(ll, 9, 7) 


XwN,2,- 


~ Dirichlet(7, 9, 11) 


XwN,3,- 


~ Dirichlet(5, 7, 15) 



WAf,0,2j 



~ Dirichlet(15, 7, 5) 



Haertel and Wiley (1993) note that whenever the prohciency model consists of binary 
skills, it implicitly induces a number of latent classes. In this example, there are 24 values 
of 0 that have non-zero prior probability. The graphical model p{6\X), described above, 
is a compact and structured way of representing the prior probability over those latent 
classes. Although this distribution is over all 24 possible latent classes, only 9 of them are 
identihable from the data (Table 2). This property of the test design will manifest itself 
later in the analysis. 
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Mixed Number Subtraction Link Models 

The model implicit in Table 1 is a conjunctive skills model; that is, a participant 
needs to have mastered all of the skills shown in the appropriate row in order to solve the 
problem. If the participant has mastered all of the skills necessary to solve a particular item 
(or item from an evidence model), we say that student has mastered the item (evidence 
model). In general, students will not behave according to the ideal model; we will get false 
positive and false negative results. 

We now build a series of link models, which follow that intuition. The 2LC model uses 
two parameters per link model: the true positive and false positive probabilities. That is: 

{ TToi if Examinee i mastered all the skills needed to solve Item j, 

' ( 2 ) 

TTjo otherwise. 

Suppose the j-th item uses the evidence model s,s = 1,2,... 6. Although s is 
determined by the item, this notation does not reflect that. Let be the 0/1 indicator 
denoting whether the Examinee i has mastered the skills needed for tasks using Evidence 
Model s. Note that the 5j(s)S for any examinee are completely determined by the values of 
$1,02, .. .0^ for that examinee. The likelihood of the response of the Tth examinee to the 
j-th item is then taken as 

~ Bernoulli (7 Tj5.(^^ ) (3) 

The local independence assumption is made; that is, given the prohciency the response 
of an examinee to the difference items are assumed independent. The probability ttji 
represents a “true-positive” probability for the item; that is, it is the probability of getting 
the item right for students who have mastered all of the required skills. The probability 
TTjo represents a “false-positive” probability; it is the probability of getting the item right 
for students who have yet to master at least one of the required skills. The probabilities 
TTjQ and TTji are allowed to differ over j (i.e., from item to item), both within and across 
evidence models. However, we use the same priors for all items: 

Tijo ~ Heto(3.5,23.5) (4) 

TTji ~ Heto(23.5,3.5) 
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This model is very similar to the “noisy- and” model discussed in Pearl (1988) and 
Junker and Sijtsma (2000) and also to the fusion model of Hartz, Roussos, and Stout (2002). 
However, both the noisy-and and fusion model include additional terms for modeling the 
effect of missing each of the individual skills. Thus, their models are somewhat softer than 
the mastered/not-mastered approach of the 2LC model described here. The fusion model 
also includes an additional continuous prohciency model variable for prohciency to apply 
the skills to solve the problem. In the later sections we will soften the 2LC model in several 
ways. 

Bayesian Analysis and Markov Chain Monte Carlo Algorithm 

Although a substantial amount of prior information about the values of the parameters 
of the 2LC model (A and tt) is available, we still would like to rehne that knowledge from 
data. In particular, our interest is to know about their posterior distributions for modeled 
parameters based on the observed test outcomes. A knowledge of the whole distribution is 
essential in order to be able to properly criticize the model. However, simply applying the 
Bayes theorem to learn about the posterior distribution (which is the hrst step in a typical 
Bayesian analysis) leaves us with an integral, which is impossible to compute analytically. 
The fact that all of the prohciency variables are latent and hence missing causes many of 
the convenient independence properties of our model to disappear. 

Imputing values for the latent variables and parameters allows us to exploit those 
independence conditions. In particular, the MCMC simulation repeatedly samples from the 
distributions of the latent variables and parameters using a Markov chain whose stationary 
distribution is equal to the posterior distribution. As long as the Markov process is run long 
enough so that the distribution of the draws is close enough to the stationary distribution, 
we can calculate quantities of interest using Monte Carlo integration with this sample. 

The Gibbs sampler and the Metropolis-Hastings algorithm (see, e.g., Gelman et ah, 
1995) are two of the most common MCMC algorithms. A number of books, such as 
Gelman et ah (1995), give a detailed discussion of the MCMC methods. We use the BUGS 
software (Spiegelhalter, Thomas, Best, & Gilks, 1995), which builds a Gibbs sampler (or, 
if necessary, a Metropolis-Hastings algorithm) based on the description of the problem 
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(implicitly the model graph). The Gibbs sampling algorithm can exploit the conditional 
independence implicit in the graphs in Figures 1 and 2. Mislevy, Senturk, et ah (2001) 
and Yan et ah (2003) describe their applications using the algorithm for the mixed number 
subtraction example. 

3. Diagnostics for Models With Discrete Proficiency Variables 

Although the model described in the previous section has a quite complex latent 
structure, it is also a good reflection of the cognitive theory of the domain as expressed by 
Klein et ah (1981). Therefore, studying model £t and attempting model improvement will 
help us refine not only the measurement properties of the model but also the underlying 
cognitive model. 

In evaluating the fit of educational assessment models, an investigator can look at three 
kinds of model fit tests: 

• tests of global fit, indicating the overall fit of data to the model 

• tests of item £t, identifying assessment items that the model does not predict well 

• tests of person fit, identifying participants whose response patterns are not predicted 
well by the model. 

Many diagnostic tests are based on analysis of residuals — the difference between the 
predicted and observed values. We perform Bayesian model fitting in this work — so the 
standard residual plots do not apply directly. This section hrst discusses how a Bayesian 
version of the classical residual plot (Chaloner & Brant, 1988) may be useful for our 
problem. Then this section reviews a number of plotting techniques aimed at analyzing 
item performance. Because we fit the model using an MCMC algorithm, we have access 
to the generated values of all the parameters and variables of the model. As an obvious 
outcome, we can apply some additional model fit techniques. One of them is the posterior 
predictive model checking method, which is also discussed in this section. 
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Bayesian Residual Analysis 



In linear models, analysis of residuals has proved to be a robust tool for diagnosing 
problems with the model £t. Because the model described above is a full Bayesian model, 
we use the Bayesian approach to residual analysis (Chaloner & Brant, 1988). Suppose 
Xi denotes the observation for individual i. Suppose further that E{Xi\uj) = Ei, where 
u) = {ui,U 2 , ■ ■ ■ ujm) denotes the vector of all parameters in the model. Consider the realized 
residual €i = Xi — E^- After data has been collected and a Bayesian model has been fitted, 
Xi is considered outlying if the posterior distribution for the residual is located far from 
zero (Chaloner & Brant, 1988). 

For the mixed number subtraction data, all the individual observations are binary. The 
residuals from binary response models are difficult to define and interpret (Albert & Chib, 
1995), mainly because the distribution of an individual observation is far from the normal 
distribution. Therefore, in this work, we look at residuals after some pooling, which creates 
more meaningful and more stable residuals. 

Let Oi denote the raw score (or number-correct score) for Examinee i. The raw score 
is a very natural quantity to examine in the analysis of any test data — the classical test 
theory revolves around the raw score and the raw score plays an important role in the 
IRT as well (e.g., van der Linden and Hambleton, 1997). The mixed number subtraction 
link model in (2) provides the basis of computing the expectation of the raw score of an 
examinee conditional on the parameters of the model, resulting in an expression for the 
realized residuals e*. Examining the posterior distribution of the realized residuals may 
provide some insights about the fit of the model. The idea is described in much more details 
in the next section, where we apply the idea to the mixed number data. 

Item Fit Plots 

For IRT models, one way to detect lack of item fit is to compare the average item 
performance levels of various proficiency groups to the performance levels predicted by the 
model (see, for example, Hambleton and Swaminathan, 1985 and Hambleton, 1989). The 
comparison is made mostly by plotting an item characteristic curve (ICC), which shows 
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the observed vs. predicted proportion correct scores for the various prohciency groups; a 
X^-type test statistic is also used to make the comparison. Yen (1981) used groups based on 
the likelihood estimates of prohciency of the examinees while Orlando and Thissen (2000) 
formed the groups based on the raw scores of the examinees. An ICC is plotted for each 
item in a test to obtain the ht of that item. Too many mishtting items indicate a problem 
with the ht of the model to the data; on the other hand, very few mishtting items usually 
indicate that those particular items are outliers in the sense that they cannot be explained 
by the model. 

When extending the IRT item ht ideas to models with discrete prohciency variables, the 
hrst task is to identify the groups comparable to the prohciency groups used in IRT. The 
equivalence classes of states of the prohciency variables (see Table 2) form natural groups, 
but they rely on the state of unobserved variables. However, because we use the MCMC 
algorithm to ht the model, there is a way to form the groups and hence to judge item ht. 
Looking at the draws of the prohciency variables for the individuals in each iteration of the 
Markov chain, we can classify the individuals into diherent groups according to diherent 
combinations of the prohciency variable values. By comparing the observed proportion 
correct for an item to the expected proportion correct for each group of individuals, we may 
have an idea about the ht of the item. As with IRT models, this can be done graphically or 
by using a y^-type test statistic. 

The measure may have low power as it depends on unobserved quantities, but provides 
us with one way to assess item ht, and is proved to be useful in the real data example later. 

Posterior Predictive Model Checking 

Let y represent all of the observed variables in our model and let uj represent all of the 
parameters and unobserved (prohciency) variables. Let denote replicate data that we 
might observe if the experiment that generated y is replicated with the same value of (jJ 
that generated the observed data. Since the value uj that generated the observed data is 
unknown, we derive the posterior (given y) predictive distribution of y^^^ by averaging over 
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the plausible values of u), given by the posterior distribution p{u>\y), 

p{y''^^\y) = j p{y'''''^\u))p{u}\y)du}. 

Guttman (1967) applies the posterior predictive distribution in a goodness-of-£t test. Rubin 
(1984) suggests simulating replicate data sets from the posterior predictive distribution for 
model checking. Any significant difference between the replications and the observed data 
indicates a possible failure of the model. 

In practice, for a given diagnostic measure, D{y), any significant difference between 
the observed value D{y) and the reference distribution of D{y''^^) indicates a possible 
model failure. Gelman et ah (1996) extend the posterior predictive approach to use 
diagnostic measures D{y,uj) that depend on the data and the parameters. The divergence 
of the data from the posterior predictive distribution can be determined by comparing the 
posterior predictive distribution of D{y''^P,u}) with the posterior distribution of D{y,uj). 
The comparison can be carried out easily by simulation. We draw N simulations 
a;^,. . . from the posterior distribution of oj, and then draw one from the predictive 
distribution p{y \ u)) using each simulated oj. We then have N draws from the joint 
posterior distribution p{y'"^^, uj \ y). The posterior predictive check boils down to comparing 
the values of the realized discrepancy D{y,oj'^) and the replicated discrepancy measures 
n = 1,2, . . . N, perhaps by plotting the pairs (D(y, cl»"), a;”)) in 

a scatter-plot. One popular summary of the comparison is the tail-area probability or 
Bayesian p- value, 

Pb = P{D{y,u;) > D{y''^P,uj)\y) 

= j j hD(y,u:)>D(y--P,u,)]P{i^\y)p{y''''^\^^)du)dy^''P, 

where is the indicator function for the event A. The p-value is estimated from the 
simulations as the proportion of the N replications for which D{y,u)^) > 

Very extreme posterior predictive p- values (close to 0 or 1) indicate model misfit. 

Posterior predictive checks have been criticized for being conservative (see for example, 
Bayarri & Berger, 2000, and Sinharay & Stern, 2003). Still, they are easy to carry out and 
interpret. They are especially useful if we think of the current model as a plausible ending 
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point with modifications to be made only if substantial lack of fit is found. Successful 
applications of the technique in psychometrics include Johnson, Cohen, and Junker (1999), 
Hoijtink and Molenaar (1997), Mislevy, Senturk, et ah (2001), and Sinharay and Johnson 
(2003) and the references therein. 

For this work, we use as discrepancy measures the sum of squares of standardized 
residuals over the persons (to detect item £t) or over the items (to detect person fit) and 
proportion correct for items and persons. 



4. The Diagnostics Applied to the 2LC Model 

We apply the different diagnostics discussed in Section 3 to the mixed number 
subtraction example in an attempt to find out whether the 2LC model adequately explains 
the variability in the data set. The first part of this section describes briefly about fitting 
the 2LC model using an MCMC algorithm. We then examine Bayesian residual plots for 
the 2LC model. What follows is an analog of ICC plots for assessing item fit. We then 
attempt to build a fit statistic to identify the items with problems. Then we look at the 
posterior predictive model check statistics. Finally, we summarize what the diagnostics 
applied tell us about the 2LC model. 



Fitting the Model Using MCMC Algorithm 



Mislevy, Almond, et ah (2001) describe fitting the 2LC model to the data set using the 
MCMC algorithm. The joint posterior distribution of the parameters of the model given 
the data X is given by 



p{e,X,TT\X) oc {nn p{Xij\6i,7Zj) 



p{\). 



As mentioned before, we use the BUGS program (Spiegelhalter et ah, 1995) to run the 
MCMC algorithm that fits the 2LC model to the data. 

The BUGS program is used to generate five chains of size 3,000 with dispersed starting 
values. Looking at the plot of the Gelman- Rubin diagnostic measure (e.g., Gelman et ah, 
1995), we find that convergence is achieved within a few hundred iterations. We retain 
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the last 2,000 values in each chain to obtain a total posterior sample of size 10,000. The 
posterior summary, the posterior mean, sd, and three quantiles (2.5%, median, and 97.5%) 
of a few parameters are given in Table 3. 



Table 3. 



Posterior Summary of a Few Parameters for the 2LC Model 



Parameter 


Interpretation of 
the parameter 


Mean 


sd 


2.5% 


Quantiles 
Median 97.5% 


Ai 


P{6 


'1 = 1) 




0.82 


0.02 


0.78 


0.83 


0.87 


A 20 


P{02 = 


l\9i = 


0) 


0.13 


0.06 


0.03 


0.12 


0.26 


A 21 


P{02 = 


l\9i = 


1) 


0.91 


0.03 


0.85 


0.91 


0.96 


7Tl,0 


P{Xi = 


1 A(4) ^ 


= 0) 


0.08 


0.02 


0.04 


0.08 


0.12 


VTl,l 


P{Xi = 


1 <^i(4) ^ 


= 1) 


0.88 


0.03 


0.82 


0.88 


0.93 


7T4,0 


P{X4 = 




= 0) 


0.32 


0.05 


0.21 


0.32 


0.42 


VT4,1 


P{X4 = 




= 1) 


0.78 


0.02 


0.73 


0.78 


0.83 


7T15,0 


P{Xl5 = 


1 <^j(4) 


= 0) 


0.03 


0.01 


0.01 


0.03 


0.05 


7T15,1 


P{Xl5 = 


1 A(4) 


= 1) 


0.82 


0.03 


0.76 


0.83 


0.89 



The posterior mean of 0.82 for Ai snggests that, on an average, 82% stndents have the 
Skill 1 (basic fraction snbtraction). Note that this is a key skill in the sense that all the 15 
items in this test reqnire the presence of this skill to solve the item snccessfnlly. We also 
notice that of those who do not have Skill 1, about 13% have Skill 2; and approximately 
91% of those having Skill 1 have Skill 2. Note that among the As shown in the table, the 
posterior sd is largest for A 20 indicating the presence of the least number of students with 
this particular combination. A closer look reveals that the posterior distribntion of A 20 is 
very close to the Beta{3.5,23.5) prior distribntion (Note that the mean and sd of the prior 
distribntion are 0.13 and 0.06, the same as the corresponding posterior qnantities). This is 
because there are no tasks in the test that assess the presence of Skill 2 in the absence of 
Skill 1 (this can be verihed in Table 1), making it impossible to distinguish between those 
two sets of latent classes. This shonld not make a practical difference, as once we have 
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assessed that a student lacks Skill 1, basic fraction subtraction, we would assign remedial 
exercises to address this lack and then reassess for the presence of Skill 2. 

From the summary of the tts, we see, for example, that for the individuals who do 
not have the necessary skills for solving Item 4 (which requires Skill 1 only), the chance 
of getting the item correct is about 32%, whereas someone having Skill 1 will have a 78% 
chance of solving that item correctly. 

Bayesian Residual Plots 

As mentioned earlier, this paper examines residuals based on the number correct scores 
of the examinees. Let Oi denote the observed number correct score of Examinee i. For this 
data set, Oi will range from 0 to 15. 

Suppose we know the parameters of the model. Using Oi,02, ■ ■ - O^, the values of 
the prohciency variables, it is possible to compute the expected number correct score of 
Examinee i as 

E{O,\0i,7T,X) = Ei = ^ (5) 

jeitems 

where 6i(^s) is the indicator for mastery of the skills in the Evidence Model s for Item j. The 
value of 6i(s) is determined by the values of 9i,02, ■ ■ - O^ for a particular examinee. Then, as 
in Chaloner and Brant (1988), dehne the realized residual for Examinee i as 

Ri = Oi - Ei- 

An examination of the posterior distribution of RiS may be a useful tool in this context. 
Although we don’t know the values of the parameters or of the latent variables, we have 
the draws from the posterior distribution (obtained by the MCMC algorithm) of the 
parameters given the data. Therefore, we can compute, for each examinee, values of Ri for 
each iteration of the MCMC algorithm, and then, a 95% posterior credible interval for RiS 
(formed by the 2.5th and 97.5th percentiles). 

For each examinee, a vertical line in Figure 3 plots the 95% posterior credible interval 
for the residual {Ri). The horizontal axis corresponds to the posterior mean of the 
corresponding Ei (the latter is like a predicted/fitted value for the raw score for Examinee 
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i) for the 2LC model. We jitter the latter quantities to avoid having too many overlapping 
points. The dots show the posterior mean of the RiS. The horizontal line in the middle of 
the plot shows the 0-line, that is, the line for Ri=0. 




Figure 3. Plot of the posterior distributions of the number correct score residuals vs. 
the predicted number correct scores for the 2LC model. 

Figure 3 indicates a potential problem with the 2LC model. Note that towards the left 
of the plot (i.e., for low estimated expected scores), the residuals are mostly distributed 
below the 0-line. Towards the right side (i.e., for high estimated expected scores), the 
residuals are mostly distributed above the 0-line. These plots then suggest that the 2LC 
model over-predicts the scores of the individuals who have low prohciency to solve mixed 
number subtraction problems; on the other hand, the model under-predicts the scores of the 
individuals who have high prohciency. Clearly this indicates that the 2LC model does not 
explain the data adequately — it seems to pull the number correct score towards the middle. 

Another approach to looking at overall ht is to plot the distribution of the residual 
scores vs. some measure of overall prohciency. Intuitively, the more skills a participant has 
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mastered, the more items the participant should be able to solve correctly. The 0-score of 
an individual, X]fc=i represents the number of skills an examine has mastered. Although 
the values of Oi are unobservable, we have a number of imputed values 0* from each cycle of 
the MCMC loop. The posterior mean of these values over all the iterations (after an initial 
burn-in) of the MCMC gives us a point estimate of the 0-score of an examinee. We can use 
this estimated 0-score instead of the predicted score of the examinees to create a residual 
plot. Figure 4 shows a residual plot for the 2LC model created using this method. 




n ^ ^ I r 

1 2 3 4 5 

Estimated theta score 



Figure 4. Plot of the posterior distributions of the number correct score residuals vs. 
the estimated 0-scores for the 2LC model. 

Figure 4 indicates the same problem as seen in Figure 3; in particular, it indicates a lack 
of £t. Participants who have mastered few skills are performing worse than predicted, and 
participants who have mastered all of the skills are performing better than expected. This 
may indicate a problem with the link models. Section 5 looks at some possible remedies. 

A further refinement of the Bayesian residual plot described above may be achieved by 
using £'(Oj|7r, A), taking the expectation of E{Oi\6i,7r, X) over 6i, in (5) and examining 



21 





residuals based on this refined expectation. The refined residuals will form the basis 
of a more powerful model diagnostic tool. However, for our purpose, residuals using 
E{Oi\6i,7z, X) allowed us to detect some problems with the 2LC model, and hence we do 
not pursne those based on E{Oi\TT, A). 

Item Fit Plots 

Before discussing possible remedies, we first explore in detail the item fit diagnostics 
introdnced in Section 3. As mentioned before, the key idea will be to divide the examinees 
into different gronps based on their proficiencies (so that different gronps will have different 
snccess probabilities for an item) and then compare the observed proportion-correct scores 
against the predicted proportion-correct scores of the different gronps. 

The natural groups are the nine eqnivalence classes defined in Table 2. Eqnivalence 
Class 1 represents no skills and Eqnivalence Class 9 represents all skills. The classes in 
between are ronghly ordered in order of increasing skills; however, the ordering is only 
partial. 

Thongh the true class membership for any examinee is nnobservable, we can classify 
that examinee on the basis of a set of impnted valnes of 6^ from the MCMC algorithm. 
Yan et ah (2002) use groups based on a single iteration to search for item misfit. However, 
using only one iteration from the MCMC ignores the posterior variability of the model 
parameters and class assignments. We extend the plots to average over all the iterations of 
the MCMC. 

In each iteration of the MCMC algorithm, we calculate, for each examinee, a vector of 
indicators of membership in each of the nine eqnivalence classes. Averaging these indicators 
over cycles gives us the a series of vectors r* = {th, . . . t^q), where Tik represents the 
proportion of iterations in which Examinee i was assigned to Eqnivalence Class k. The 
vector Tj provides a probabilistic classification of the examinee into an eqnivalence class. 

To determine the “observed” proportion correct for Item j for examinees in Eqnivalence 
Class k, we take the weighted average over examinees using the classification probabilities r* 
as weights. Thus, the observed proportion correct for the k-th eqnivalence class and the 
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j-th item is given by 




Note that jfkj is not observed in the true sense of the term because we do not really observe, 
but rather estimate the Tj^s. The predicted proportion correct for the k-th equivalence 
class and the j-th item is the posterior mean of the appropriate equivalence 

class and item combination. Here, h(^k){s) is an indicator that tells whether examinees 
in Equivalence Class k have mastered the skills necessary to solve items from Evidence 
Model s (where s depends on the Item j, see Table 1). For example, for examinees in 
Equivalence Class 1, 5(i)(s) = 0 for all s, and hence the predicted proportion correct score 
for any item will be the posterior mean of tt^o- If the model fits the data well, we can expect 
the observed proportion correct to be close to the predicted proportion correct for all 
combinations of equivalence classes and items. 

Comparing the pijjS to the corresponding predicted values for all equivalence classes 
should provide information about how well the link model for Item j hts the data. Many 
large deviations indicate a problem with the model. We use a rough conhdence interval 
based on a presumption of normality. Using the fact that the responses of different 
examinees to an item are independent, we obtain 



Vkj = Var{pkj\0,7r) 




We estimate the vrs in the above quantity by their posterior means to obtain an estimated 
variance of pkj. We make a normal approximation of the proportions to take (p^+2 x 
as a rough 95% conhdence interval. Because there are only nine equivalence classes and 
as many as 325 examinees, normal approximation of the proportion correct is not entirely 
unreasonable. 

Figures 5 and 6 show the item £t plots for the 15 items. The horizontal lines in each 
plot are the posterior means of njo and for the items (which are the predicted proportion 
correct for the equivalence classes). The vertical lines constitute the rough 95% confidence 
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Equivalence Classes 



Equivalence Classes 



Figure 5. Item fit plots for Items 1—8 in the 2LC model. 
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Figure 6. Item fit plots for Items 9 15 in the 2LC model. 






intervals attached to the observed proportion correct for the examinees in each equivalent 
class. The glyph in the middle of an interval represents the observed proportion correct for 
that equivalence class. The glyph depends on the value of 5(fc)(s); h is an asterisk (*) for 
equivalence classes that have mastered the skills necessary to solve the item, and the glyph 
is an X for equivalence classes lacking one or more skills. Confidence intervals having an 
X at the center, but not covering the lower horizontal line indicate possible misfit, as do 
conhdence intervals having an asterisk at the center, but not covering the upper horizontal 
line. 

The total weight of Equivalence Class /c, plays roughly the same role as 

the sample size if we could observe the classes. Because the total weight is quite small for 
Classes 2 to 6 (ranging from about 2 to 11), a difference in the observed and predicted 
proportions for those classes could be due to sampling variability. For the other equivalence 
classes, such differences are more likely to imply a serious problem. 

In Table 4, all the combinations of equivalence class and item for which there is a 
discrepancy between the observed and predicted proportions are marked with a A 

look at the table suggests that there is a problem with about half the items in the data 
set. This indicates that the model cannot explain the data set adequately. Looking more 
closely, we see that many of the problems occur for Equivalence Class 1 (people who have 
not mastered any skills) and Equivalence Class 9 (people who have mastered all of the 
skills). This is consistent with the Endings from the residual plots. 

A Test Statistic to Detect Item Fit 

To quantify the item fit plots discussed above, we define a test statistic. The quantity 
Pfc)' is the estimated observed proportion correct for the j-th item for participants in the 
k-th equivalence class. So the observed number of examinees in the k-th equivalence class 
getting the item j correct, denoted Okj, is given by 

^kj Pkj ^ ^ Ljfc ^ ^ 

i i 

The predicted number of examinees in the /c-th equivalence class getting the item j correct, 
denoted Ekj, is obtained by multiplying the predicted proportion correct score for that 
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Table 4. 



Problematic Cases Detected by the Item Fit Plots 



Item 

no. 




Equivalence class 






Value 
of X^j 


1 2 


3 4 5 6 7 


8 


9 


1 


V 








10.7 


2 










2.2 


3 


V 


V 






15.0“ 


4 




V 




V 


39.5^ 


5 


V 








7.6 


6 










3.3 


7 










2.7 


8 










3.7 


9 




V 






8.5 


10 


V 




V 




24.1^ 


11 






V 


V 


10.6 


12 










2.8 


13 


V 








14.4“ 


14 










4.0 


15 










3.9 



“Larger than 95 percentile of Xj ~ 14. 
^Larger than 99 percentile of X 7 ~ 18.5. 



combination (which is the posterior mean of the suitable by 

We now define the item fit statistic for the j-th item, as 



2 ^ (Ofcj - EkjY ^ 

^ Ekj ^ 

NkiPkj — Ekj) 



9 



k=l 

9 



k=i 

2 



{{Nk - Okj) - {Nk - Ek,)f 
Nk — Ekj 



Ekj{Nk Ekj) 

Although the statistic is inspired by classical tests, the reference distribution is 



( 7 ) 
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unknown. If the membership of the examinees to the equivalence classes were known, the 
statistic Xj would follow a distribution with 7 d.f. (because there are nine equivalence 
classes for any item, and we are estimating two tts for each item) under the null hypothesis 
that the 2LC model hts the data set. But the membership is estimated making the true null 
distribution difficult to compute. The same issue arises with the statistics suggested by 
Hambleton and Traub, 1973, and Yen, 1981, in the context of IRT models. However, even 
though the null distribution of these statistics is unknown, we can compare the values of y^s 
computed for a number of competing models by keeping the number of groups (equivalence 
classes in this problem) constant to judge which model is preferable over the others (in the 
same vein as suggested by Hambleton & Traub, 1973, for comparing statistics to detect 
model misht for IRT models). 

We use the Xj distribution as a rough reference to provide a heuristic to flag items that 
are not fit well by the model. The computed values of the x] statistics along with them are 
given in the last column of Table 4. Values greater than the 95 and 99 percentiles of the 
Xj distribution are flagged. Note that the statistics and items £t plots flag the same 
items. We see low values of y^s for the items with no for them in the table (e.g.. Items 
2, 6, 7, 8) while the values are high for Items 3, 4, 10, and 13 for which we see a few (i.e., 
discrepant cases, as we discussed earlier) in the table. 

Posterior Predictive Model Checking 

The posterior predictive model check diagnostics are based on comparing observed data 
Xij to replicated data, As each iteration of the MCMC algorithm generates values 

for all of the parameters as well as the latent prohciency variables, the only additional 
work involved in generating the replicates is that of generating from the model (3) 

using the generated values. The diagnostic measures we use, D{y,uj), are the proportion 
correct (for both items and examinees) and the mean squared Pearson residuals (taking the 
averages across items, examinees, and both). 
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Consider Tij{X,uj) given by 



T,^{X,u;) 



(Xj, - E(Xy))^ 
V(X„) 

(Xij — 



where a; represents the parameters and latent variables. The quantity Tij{X,uj) measures 
the error involved with the estimation of the i-th examinee and j-th item and involves both 
the data and parameters. Summing it over all items, j, produces a discrepancy measure 



Tr“’‘{X,u,) = J2^„(X,u) 

j 

for Examinee i. Summing it over all examinees, i, produces a discrepancy measure 



T‘“"'{X.u,) = Y,TiAX,u,) 

i 

for Item j. Summing the Tij(X, o;)s over both items and participants produces an overall 
discrepancy measure 



^ij {X , Ua) . 

j i 

These discrepancy measures resemble the classical goodness-of-£t measure. The 
proportion of iterations in which the values of ,uj) from the actual data exceed 

the values of , uj) is a person fit p-value for each person. Similarly, we 

can get an item fit p- value for each item and an overall p- value 

To look at some discrepancy measures that do not depend on the estimated parameters 
and latent variables, we also calculated the proportion correct scores for both the actual 
and replicated data. Prop**®™'(X) is the proportion correct for Item j, and Prop^'^^''^°^{X) 
is the proportion correct for participant i. 

The item fit p-values do not indicate any misfit, with p-values lying between 0.23 to 
0.73. The overall fit p-value is 0.35, which does not provide evidence against the model. 
Thus the posterior predictive model checks fail to detect the problems noted with the 
earlier diagnostics; however, the posterior predictive model checking method is known to be 
conservative. Alternatively, the discrepancy measures used may not have been effective. 
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The person fit p- values flag a number of persons with unusual response patterns. Using 
the measure ,uj), we observe extreme p-values (more than 0.95 or less than 0.05) 

for Examinees 36, 75, 77, 101, 113, 116, 137, 179, 272, 298, 315, and 319. Looking at the 
response patterns of these examinees, we hud that their response patterns are unusual. For 
example. Examinee 315, who has the most extreme p-value of 0.996, gets only Items 8, 

11, 12, 13, and 15 correct. Looking back at Table 1, we hud this is quite unusual. This 
examinee gets wrong the only two items (2 and 4) in Evidence Model 1 (requiring Skill 1 
only). This suggests that he should get all 15 items wrong because all of them require 
Skill 1. However, he gets correct Item 8, (requiring Skills 1 and 2). He also gets 3 out of 5 
items correct in Evidence Model 4 (requiring Skills 1, 3 and 4), but gets only 1 correct out 
of the 3 in Evidence Model 3 (requiring Skills 1 and 3), which requires a subset of the skills 
need for Evidence Model 4. As a result, the model cannot explain well the response of 
Examinee 315, resulting in an extreme p- value. 

For Prop^^^^°"'{X), extreme p-values were obtained for Examinees 40, 41, 42, 44, 45, 
50, 52, 53, 79, 82, 96, 113, 116, 136, 137, 148, 161, 178, and 288. Again, these examinees 
have unusual response patterns under the model. For example. Examinee 40, with a p-value 
of 0.998 gets Items 1, 6, 7, 12, 14, and 15 wrong. That means he gets 4 out of 5 items in 
Evidence Model 4 wrong. Since Evidence Model 5 requires a set of skills that includes all 
those required for Evidence Model 4, we would expect that he would get all three items in 
Evidence Model 5 wrong under the model; but he gets two of those correct. Naturally, the 
model underestimates the proportion correct for this examinee and we see a large p-value. 

Notice that only three examinees are flagged as unusual by both the measures 
considered here, indicating that they measure different types of discrepancies and that both 
of them may be useful. One of the three common examinees. Examinee 113, gets Items 5, 
7, 8, 10, 12, and 15 wrong. The examinee gets Item 8 wrong (the only one in Evidence 
Model 2, requiring Skills 1 and 2), indicating that he probably does not have Skill 2. He 
gets 3 out of 5 wrong in Evidence Model 4 (requiring Skills 1, 3 , and 4), indicating that 
he probably does not have Skill 4. From these observations, we would expect him to get 
Item 6 wrong under the model as this item requires Skills 1, 2, 3, and 4. However, he gets 
the item correct. 
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Since not many examinees are found to have extreme p-values (as we are using a 10% 
level test, we expect about 32.5 examinees to be flagged compared to the 28 flagged by 
both person- £t diagnostics), the person-fit indices do not indicate any major failure of the 
model. They do flag some response patterns that are unusual under the model, but they 
may suffer from the same lack of power that we saw with the item fit tests. Furthermore, 
the test length, 15 items, is rather small, which would also contribute to a lack of power. 

Limitations of the 2LC model 

The overall fit plots clearly indicate that the 2LC model cannot explain the data 
satisfactorily. The item fit plots give us a clue as to what might be happening — the model 
seems to predict Equivalence Classes 1 and 9 unsatisfactorily. 

One limitation of the 2LC model is that it uses an all-or-nothing approach to explain 
the probability of a correct outcome from an item (i.e, it divides the examinees into two 
groups based on whether or not they have mastered all the necessary skills for solving an 
item and assigns the same probabilities to all equivalence classes within a group). Actually, 
that may not be the case — it may be easier to compensate for the lack of one skill than 
for the lack of many. There also is no latent variable representing overall mathematical 
proficiency. It is possible that such a skill would be related to both how quickly a student 
could master the skills and how readily the participant could apply them to a given 
problem. 



5. Three Revised Models 

Increasing the number of parameters in the link models will soften the all-or-nothing 
nature of the 2LC link model. By partitioning the equivalence classes into two groups, the 
2LC model only fits two values for each item. Expanding the effective number of 

groups has an appropriate softening effects. 

To form one new model, we divide the examinees not having the necessary skills to 
solve an item into two groups: 

• those who have not mastered Skill 1 
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• those who have mastered Skill 1, but still lack one or more additional skills necessary 
for solving the item 

There is also the group who have the necessary skills for an item. We assign different 
success probabilities to each of the three groups. Mathematically, the prohciency model 
remains the same as the 2LC model, but the link model for examinee i and Item j (that 
uses evidence model s) is now: 

XijlTTjm, (h*p) + ll = m) ~ Bern{7ijm), for m = 0, 1, 2, (8) 

where is the indicator of whether the Examinee i has mastered the all the skills 
required for solving items using evidence model s, and 1} is the indicator function denoting 
whether Examinee i has the Skill 1. This model is called the “3LC” model as it has three 
parameters, vr^o, TTji, and 7Tj2, for each item. 

Treating the students who lack Skill 1 specially makes sense from both a empirical 
and a cognitive perspective. Empirically, Table 4 flagged Equivalence Class 1 (lacking 
Skill 1) more often than any other. However, according to our cognitive model. Skill 1 is 
a prerequisite for all of the others. Students who have yet to master Skill 1 are probably 
struggling with the very basics of fraction subtraction, and it makes sense that they would 
be less readily able to solve any problem. Consequently, we use a lower prior distribution 
for TTjo, a Hefa(3.5, 23.5), than we use for vr^i, a Beta{6,21). The prior for true positives, 
7Tj2 remains a Heta(23.5, 3.5). Note that Items 2 and 4 only require Skill 1. Therefore, we 
set TTjo = TTji for those items. 

Extension of the 3LC model by adding another tt for each item produces the 4LC 
model. We divide the group of examinees having all the necessary skills to get an item 
correct into two subgroups: those who have mastered all hve skills and those who are yet 
to master one or more skills. We assign different success probabilities to each of these two 
sub-groups. The prohciency model is the same as the 2LC and 3LC models, but the link 
model is now: 

XijlTTjmj (5i(s) + + If = m) ~ Bern{7ijm), for m = 0, 1, 2, 3, 

where If is as dehned earlier and If is the indicator function denoting whether the 
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Examinee i has all of the hve skills. 



TTjo ~ Beta{2,25) 
TTji ~ Beta{6,21) 
7Tj2 ~ Beta{21,6) 
TTjs ~ Beta{25, 2) 



This model is expected to explain the performance of the examinees with low and high 
prohciencies better than the 2LC model. 

Yan et al. (2003) consider another extension of the 2LC model, which introduces a 
new latent variable r]i, which represents the examinee’s propensity to solve problems with 
or without the requisite skills. In this model, the success probability of Examinee i for 
Item j (Item j uses the Evidence Model s, and 5j(s) is the indicator denoting whether the 
Examinee i has mastered the skills needed for items using Evidence Model s) is: 

, _ exp (logit (7Tjg,(,)) + SjT]i) 

I exp (logit (7 Tj,5,(^, ) + Sjf]i) ’ 

where, logit (x) = log Sj is an item slope parameter for Item j. We refer to this as the 
“2LC+7]” model. The prior distribution assumed for rjiS is independent N{0, 1) and that 
for SjS is independent Y(— 2,0.5). 

The results of the diagnostics for the three expanded models are discussed below. 
Bayesian Residual Plots 

We create the residual plots for these models in the same way as discussed in Section 4. 
Figure 7 shows a plot for the posterior distribution of the number correct score residuals 
vs. the jittered predicted score estimates for the four models. 

As in Section 4, we also create residual plots using the estimated 0-score of an examinee 
instead of the predicted number correct score. Figure 8 shows residual plots for all four 
models using this method. We draw a horizontal line at y-coordinate 0 in these residuals 
plots for ease of viewing. 
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Predicted Number Correct Score 



Predicted Number Correct Score 




Predicted Number Correct Score 



Predicted Number Correct Score 



Figure 7. Plot of the posterior distributions of the number correct score residuals vs. 
the predicted number of correct scores for the four models. 
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1 2 3 4 5 1 2 3 4 5 

Average Theta Score Average Theta Score 



Figure 8. Plot of the posterior distributions of the number correct score residuals vs 
the estimated 0-scores for the four models. 
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Equivalence Classes 



Figure 9. Item fit plots for Items 1—8 in the 3LC model. 
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A look at the two sets of plots suggests that the 3LC and 4LC models do not have 
the same problem of bias at the ends of the prohciency scale as the 2LC model does. 

For these two models, the posterior distributions of the residuals have roughly the same 
concentration on an average above and below the 0-line for both low-prohciency (low 
estimated 6*-score/predicted score) and high-prohciency examinees. The “2LC-|-r7” model, 
even with all the extra parameters (which results in a long run-time of the BUGS program 
htting this model), does not appear to do a very good job. It still has the same problem of 
over-prediction (like the 2LC model) at the low end of prohciency. Also, there seems to be 
a linear pattern with a negative slope in the residual plot of this model. 

Item Fit Plots 

We create item £t plots for the 3LC model (shown in Figures 9 and 10) in the same way 
as we created the plots for the 2LC model. The computation of the observed proportion 
corrects and the rough 95% conhdence interval attached to them remains exactly the same 
as that for the 2LC model. The predicted proportion corrects that we compare these 
observed proportion corrects to are computed in a similar way as with the 2LC model 
except that now there are three vrs for each item. There are three horizontal lines in the 
plot for each item now, one for each tt for that item. The line for is dashed, that for tt^i 
is dashed and bold, and the line for 11^2 is solid. A hollow circle, O, in the middle of an 
interval (marked by vertical lines) indicates that the members in that equivalence class do 
not have Skill 1 (this interval should contain the horizontal line for tt^o)- An x indicates 
that the members in that equivalence class have Skill 1, but not all the necessary skills (this 
interval should contain the horizontal line for tTji). An asterisk, *, means the members 
in that equivalence class have the necessary skills to solve that item (this interval should 
contain the horizontal line for tTj^)- Note that Items 2 and 4 have only two vrs each because 
they require only Skill 1 so = Trji. 

Figures 11 and 12 show the item £t plots for the 4LC model. Here, a hollow circle, o, 
an X, and an asterisk, *, in the middle of an interval mean the same as in the 3LC model. 
Additionally, here we use a solid circle, •, to imply that the members in that equivalence 
class have all the Skills 1-5. The four horizontal lines for tt^q, 7Tj2, and are dashed. 
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solid dashed, solid, and solid bold respectively. Again Items 2 and 4 require Skill 1 only 
and have three vrs each (tt^q = 

For the 3LC model, there is a discrepancy between the observed and predicted counts 
for Item 4 (Equivalence Classes 6, 7, and 9) and Item 5 (Equivalence Class 1). Item 9 also 
has a discrepancy with Equivalence Class 6, but as with the 2LC model, the effective sample 
size in Equivalence Class 6 is small, so this could be just chance fluctuation. Items 1, 3, 10, 
11, and 12, for which misht was detected for the 2LC model, seem to be consistent with the 
3LC model. Hence, the 3LC model improves upon the 2LC model and seems satisfactory 
for the data. 

Item £t plots for the 4LC model suggest that there is still some discrepancy with Items 4 
and 5 (Equivalence Classes 8 and 1, respectively). For both the 3LC and 4LC model, the 
posterior mean of is higher than that of tTji for these items. This is counterintuitive 
because tTjo is the success probability of the examinees who have not mastered any skills, 
while TTjo is the success probability of the examinees who have just mastered Skill 1. 

Items 4 and 5 both have unusual forms that might indicate possible problems with the 
items. Item 4 is | — |. A student can solve this item without using method A or B (Section 
2), by observing that the two quantities are the same, and hence their difference should 
be 0; Item 5 is 3| — 2. This can also be solved without using method A or B, by using 
knowledge of integer subtraction and by noticing that an integer plus ^ minus another 
integer is the difference between the integers plus As both of these items admit to 
solutions without using the mixed number subtraction algorithms being taught, they may 
be inappropriate for this assessment. 

Test Statistics 

Table 5 shows the values of the item fit statistics, given by (7), for the 2LC, 3LC, and 
4LC models. The reference distribution for the 3LC model is the with six degrees of 
freedom (as three parameters are estimated for each item) and five degrees of freedom for 
the 4LC model (as four parameters are estimated for each item). 

While the 2LC model has four items flagged as problematic, only Item 4 (| — |) is 
flagged for the 3LC and 4LC models. Note that the problematic Item 5 is not flagged for 
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Figure 11. Item fit plots for Items 18 in the 4LC model 










Table 5. 

Values of Item fit Statistics for the 2LC, 3LC, and fLC Models 



Item 2LC model 


3LC model 4LC model 


no. 




d 




1 


10.7 


7.6 


2.5 


2 


2.2 


2.0 


1.9 


3 


15.0“ 


5.2 


3.9 


4 


39.5“ 


39.3^ 


13.1“ 


5 


7.6 


8.6 


8.8 


6 


3.3 


5.1 


4.3 


7 


2.7 


3.4 


1.4 


8 


3.7 


2.3 


3.3 


9 


8.5 


5.6 


3.5 


10 


24.1“ 


6.1 


5.3 


11 


10.6 


5.2 


2.8 


12 


2.8 


4.3 


2.7 


13 


14.4“ 


4.1 


2.5 


14 


4.0 


5.1 


3.6 


15 


3.9 


7.9 


5.6 


a 


Larger than 95 percentile of X 7 


14. 


h 


Larger than 95 percentile of Xg R 


12.5. 


c 


Larger than 95 percentile of X5 


11. 



these models. 

Posterior Predictive Model Checking 

Application of the posterior predictive model checking methods to the 3LC and 4LC 
models yields similar results to the 2LC model. The measures do not indicate any item 
mishts or overall model misht, but flag a number of person for possible mishts. However, 
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the number of persons flagged is about half of that with the 2LC model, implying that 
these two models perform much better than the 2LC models in describing the response 
patterns of the examinees. The examinees whose responses are still not £t well are 36, 40, 
42, 44, 45, 75, 113, 137, 148, 161, 179, 288, 315, and 319. Note that this list includes the 
three examinees (with unusual responses) discussed earlier in Section 4. 

Choosing a Model 

The statistical diagnostic tools considered here indicate that the 3LC model and the 
4LC model seem to explain the data adequately. They are significant improvements over 
the 2LC model in explaining the overall fit and contain fewer items that do not fit. Between 
these two models, the 3LC model is preferable because it uses fewer parameters than the 
4LC model without a noticeable lack of overall fit. The introduction of the parameter r] in 
the “2LC+7]” model does not seem to offer a significant improvement. 

6. Evaluation of the Diagnostics 

The set of diagnostic measures we proposed allowed us to uncover a problem with 
the 2LC model, characterize the nature of that problem, and come up with an alternative 
model, the 3LC model, which seems to fit the data better. Even though some practical 
issues remain (in particular, determining appropriate reference distributions for test 
statistics), the diagnostics still have some value in practical applications. 

The Bayesian residual plots are simple, but may be powerful tools for detecting model 
misfit. They clearly detect the misfit of the 2LC model and even help us characterize the 
problem; specifically, the plots suggest misfit at the two ends of the proficiency scale. This 
knowledge leads to the suggestion of three possible improved models. Two of these three 
models (3LC model and 4LC model) seem to correct the problem. A refined and more 
powerful version of Bayesian residual plot is also suggested, but not pursued in this work. 
Although there is a recent surge of Bayesian statistical analysis in psychometrics, there 
has not been many applications of Bayesian residual analysis in the held, and this paper 
addresses that issue partially. 
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The item fit plots seem quite promising as well in detecting item misfits and overall 
model misfits. They provide more details about the lack of £t than do the residual plots. 
In particular, they help isolate the problem to Equivalence Class 1, which suggests the 3LC 
model as a good remedy. When applied to the 3LC model and the 4LC model, these plots 
suggest flagging two items that cannot be explained well by the cognitive theory 
and “3 1 — 2”). These items admit alternate solution paths and hence may be inappropriate 
for this examination. 

The test statistic corresponding to the item fit plots seems to have some power to 
detect problematic items. Consequently, some of the graphical tests could be automated, 
looking at the item fit plots only for items with high statistics. Unfortnnately, the 
reference distribntion for these statistics is still nnknown. Althongh the x^ statistic seems 
to provide a good henristic, the trne level and power of the test is still nnknown. Perhaps a 
simnlation stndy wonld provide a more usefnl reference valne (Williamson et al. 2000, use 
simulations with other diagnostic measures in the context of these models). 

The posterior predictive model checking method does not indicate any lack of fit of 
the overall model or the items. This lack of power possibly indicates that we have not yet 
stnmbled npon a powerfnl discrepancy measnre to use with this method. The posterior 
predictive model checking method does flag a nnmber of examinees, whose response 
patterns indeed appear unusual in the context of the problem. 

Althongh the residnal plots and item fit plots proved usefnl in this problem, they 
are still conservative diagnostics because both of them use “observed values,” which 
depend on parameters estimated from the data. Conseqnently, althongh they will detect 
extreme model misfit, they may miss snbtler problems. Fnrthermore, the lack of reference 
distribntion for the x^'fyp® statistic means that its power and level are still unknown. 

However, the most powerfnl argnment for this method is that it can point to possible 
improvements in the nnderlying cognitive theory. The weakness in the 2LC model was 
a resnlt of the strict conjnnctive model for the applications of skills to items. The data 
clearly show us that individuals who lacked all of the requisite skills behaved differently 
from those who only lacked some. This finding may lead to better understanding of how 
to strnctnre learning sitnations for stndents learning mixed nnmber snbtraction. Applying 
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the diagnostics to similar situations can provide us the information to consider revising the 
cognitive theory, the statistical model, or the way data is collected and interpreted. 
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